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STATEMENT O F TOM VIN K IN SUPPORT OF CORRECTION OF SEQ UENCE USllUQ 

I, Tom Vlnk, Associate Director, Cell & Molecular Science at Genmab 5.V. do. hereby 
declare and state; 

1. Since 1 July 2002, I am an employee of Genmab B.V., which is an affiliate of 
Genmab A/S. 

2. I have thorough experience within the fields of molecular biology, protein 
biochemistry and antibody engineering. 

3. The subject application Is the U.S. national application equivalent to PCT 
application No. WO 2004/035607. I have studied the specification and the sequence listing 
of the PCT application prior to making this statement. 

4. The subject application relates to human monoclonal antibodies against CD20, 
and it exemplifies three antibodies, 2F2, 7D8 and 11B8. The application contains claims 
relating to human monoclonal snti-CD20 antibodies, -wherein the antibodies are 
characterized by the variable heavy chain (V H ) and variable light chain (V L ) amino- acid 
sequences (SEQ ID Nos: 2, 6, 10, 4, .8 and 12), cf. claims 15, 17 and 19 of the application: 
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2F2 


70S 


1IBS 


V H nucleotide 


SEQ ID NO;l 


SEQ ID NO: 5 


SEQ ID NO 


V; nucleotide ; SEQ [ NO: 3 


; . in -o ; 


0'.' o :i 


1 - > ~cn 


- D NO 2 


SEO ID NO: 6 


V t amino acid 


(identical to SEQ ID 
N0:8) 


SEQ ID MO: 8 
(identical to SEQ ID 
N0:4) 


SEQ ID NO: 12 



5, I have been asked whether it would have been obvious for a person skilled In 
the art at the earliest priority date on 17 October 2002 ((} that the leader sequences 
erroneously are included in the variable heavy chain and light chain amino add sequences 
(SEQ ID Nos: 2, 6, 10, 4, 8 and 12} which define the mature antibodies 2F2, 7D8 and 11B8, 
respectively; and (ii) in the affirmative, how the variable heavy chain and light chain amino 
acid sequences without leader sequences should read. 

6. For the reasons explained in the following, I believe it would have been 
obvious for a person skilled in the art at the date of the earliest priority date on 17 October 
2002 (t) that the leader sequences {marked in red in Exhibit A attached, hereto) are included' 
in the variable heavy chain and light chain amino acid sequences; and (ii) how the variable 
heavy chain and light chain amino acid sequences without leader sequences should read.. 

7, It Is well known that antibodies are secretory proteins, i.e., proteins that are 
transported to the extracellular medium by passing through the intracellular secretory 
pathway of the cell, the first compartment of which is the endoplasmic reticulum. Like other 
secretory proteins, an antibody heavy or light chain protein is initially produced as a 
precursor polypeptide containing a so called signai or leader peptide (also denoted leader 
sequence), which is necessary to direct the polypeptide into the endoplasmic reticulum for 
further transport and secretion. As for other soluble secretory proteins, during the transport 
into the endoplasmic reticulum, the signal peptides of the antibody heavy and light chains 
are cleaved off, resulting in a final" mature protein (see also page 29, lines 35-36 of the 
subject application). In conclusion, any heavy or light chain from an antibody produced by 
e.g., a hybridoma will be derived from DMA and RNA constructs In which the heavy and light 
chain encoding sequences are immediately preceded by the leader sequences. 

8. To determine the cleavage site between the leader sequence and the antibody 

«2- 



i ■ ' i 



sequence so as to determine where the leader sequence stops and where the antibody 
sequence starts, a comparison can be made with known antibody sequences. Comparison 
of the protein sequence (precursor polypeptide) containing the leader sequence, as deduced 
from the cloned. RNA sequence, with a database containing known human antibody protein 
sequences (such, as the Vbase as described on page 71, lines 7-8 of the subject application) 
will reveal which part of the protein is the signal peptide and where the mature protein 
starts. At the earliest priority date of the subject application, several of these databases 
were available, including the above Vbase as well as the Kabat [1] or IMGT [2] databases 
(see Exhibit C attached hereto for full citations of [1] and [2]). 

9. More particularly, these databases contain collections of ail germ line signal 
peptides of all human antibodies, and a simple alignment of the derived amino acid 
sequences (V H /V\ sequences plus leader sequences) with these signal peptide databases 
would reveal which part of the sequences are the signal peptides and which part of the 
sequences are the V H /Vi. sequences. Moreover, comparison/alignment of the derived amino 
acid sequences (Vh/Vl sequences plus leader sequences) with the collection of mature 
human V H and V L sequences in these databases would reveal where the mature proteins 
start. 

As an example, in Figure 1 in Exhibit B attached hereto, a selection of V H 
germiine leader peptides is shown (screenshot from the Vbase). The leader peptide of V H 
7D8, M ELGLSWIFLLAI LKGVQC, is easily identified as sequence VH3 3-09, 

Alternatively, using the DNA plot module on the Vbase site (which, to my 
knowledge, was available prior to the earliest priority date of 1.7 October 2GG2), It Is 
possible to align the nucleotide sequence of your rearranged V gene to its closest mature 
germiine V. In Figure 2 in Exhibit B this was done for the Vl region of 11B8 and, as is shown 
in the screenshot, the start of the mature V, encoding sequence is gaaatt, confirming the 
start of the mature V L polypeptide sequence at the corresponding amino acids EL This can 
also be done with the Vquest module at the IMGT site, see Figure 3 in Exhibit B. 

10. The leader sequence for a V H sequence is typically 19-21 amino acids long, 
and the leader sequence for a V L kappa sequence is typically 19-23 amino acids long. A 
protein always starts with a Met residue, so the leader sequences always start with a Met 
residue, A V H sequence may start with a Glu or Gin residue, a V t kappa sequence may start 
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with Ala, Asp, Val, Asn or G!u. V K signal peptides can end with Cys, Ala, Ser or Pro, and V L 
kappa signal peptides can end with Cys, Ala, Giy or Glu. Accordingly, a person skilled in the 
art would know from studying the nucleotide and amino acid sequences (SEQ ID Nos: 1-12) 
that the leader sequences are indeed included. Also a person skilled in the art would note 
that the amino acid sequences are longer than usual for the mature amino acid sequences. 
This information was known to a person skilled in the art at the earliest priority date of the 
subject application and further supports that the leader sequences are indeed included in 
the variable heavy chain and light chain amino acid sequences as identified by SEQ ID Nos: 
2, 6, 10, 4, 8 and 12, respectively. 

11. Alternatively, at the earliest priority date of the subject application several 
predictive methods were described, such as [3-5] (see Exhibit. C attached hereto for full 
citations of [3-5]), by which signal peptide sequences could be predicted from a protein 
sequence. Using the Signal? server (which was available prior to the earliest priority date of 
17 October 2002), we have performed this for the V H polypeptide of antibody 7D8 and a 
sdreenshot of the result is shown in Figure 4 in Exhibit B attached hereto, indicating the 
signal peptide cleavage site between the C and E amino acid, confirming the start of the 
mature V H region as EV. 

12. In conclusion, it is clear that the claims defining the mature antibodies by the 
variable heavy chain and light chain amino acid sequences by mistake contain the leader 
sequences and that nothing else was intended than to define the mature antibodies without 
these leader sequences, 



13. Monoclonal antibodies, 2F2, 7D8 and 11B8, are characterized by having the 
following V H CDR1 regions in the application, cf. for example claims 33, 36 and 40 of the 
application: 





2F2 


7D8 


11B8 


V H CDR1 


SEQ ID MO: 13 


. , .3 UO',19 


SEQ ID NO;25 



The CDR regions (SEQ ID Nos: 13-18, 19-24, 25-30) are highlighted in the variable heavy 
and light chain sequences in Exhibit B attached hereto. 
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14. I have been asked whether it would have been obvious for a person skilled in 
the art at the earliest priority date on 17 October 2002 (i) that the V„ CDR1 regions 
erroneously contain an additional amino acid; and (ft) in the affirmative, how the correct V H 
CDR1 regions should read. 

15. I believe that it would have been obvious for a person skilled in the art at the 
earliest priority elate on 17 October 2002 that (i) the V K CORi regions contain an additional 
amino acid; and (ii) how the correct CD Rl regions should read for the below reasons. 

16. At the earliest priority date of the subject application, several different 
methods were available to determine the CDR regions of antibodies. The most common 
method was the Rabat method [1] (see also [6] page 432-433 for an comprehensive 
manual for assigning the CDR regions of an antibody using the Kabet numbering scheme; 
see Exhibit C attached hereto for a full citation of [6]). Analysis of SEQ ID IMos. 14, 15, 16, 
17, and 18 (2F2), 20, 21, 22, 23, and 24 (7D8) and 25, 27, 28, 29, 30 (11B8) defining the 
Vh CDR2 and CDR3 regions and the V L CDR1, CDR2. and CDR3 regions of 2F2, 7D8 and 
11B8, respectively, shows that the Kabat numbering scheme has indeed been used in this 
application. 

17. When applying the Kabat numbering rules to the V H CDR1, cf. [6] page 432: 
"Start; Approximately residue 21 (always 9 after a C (Cys)) 

Residues before; always CXXXXXXXX (X meaning; any amino acid) 
Residue after; Always W. Typically WV, but also WI, WA 
Length; 5-7 residues " 

It is obvious when applying these ruies to the V K CDR1 regions that the 
respective V H CDR1 regions should be "Asp Tyr Aia Met His" for SEQ ID No 1.3, "Asp Tyr Ala 
Met His" for SEQ ID No 19 and "Tyr His Ala Met His" for SEQ ID No 25, respectively. When 
comparing these to the V H CDR1 sequences as defined in the application, it appears that, by 
mistake, an extra amino acid has been added to these Vh CDR1 regions, 

18. In conclusion, in view: of the Kabat rules consistently applied in the application 
to designate the CDR regions., it is clear that Che addition of a further amino acid to the V H 
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CDR1 sequences was a mistake and that nothing else was intended than to define the v H 
CDR1 sequences without inclusion of this additional amino acid. 

19. I declare that ail statements made in this Declaration of my own knowledge 
are true and that ail statements made on information and belief are believed to be true. 
Moreover, these statements are made with the knowledge that willful false statements and 
the ilke made by me are punishable by fine or imprisonment, or both, under § 1001 of Title 
18 of the United States Code and that such willful raise statements may jeopardize the 
validity of the application or any patent issued thereon. 




Attachments: 

Exhibit A: Marked-up Sequence Listing (SEQ ID NOs:l-12), color annotated 
Exhibit B: Marked-up Sequence Listing (SEQ ID NOs:2, 6, and 10), color annotated: 

and Figures 1-4 
Exhibit C: List of References cited in the above Declaration 
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Exhibit A 



<210> 1 V H 2W2 

<211> 424 

<212> DMA 

<213> Homo sapiens 

SL 5 i^.lsi?\A^^ 60 

gtgcagc;:gg tggagc.ctgg gggaggcttg gtacagcctg gcagqtccct gagactctcc 120 
1 j i i h i ' < 1 3 c tc c actgggtccg gcaagctcca 180 

gggaagggi 

gactctgtga agggccgatt caccatctcc agagacaacg ccaagaagtc cccgtatctg 300 

caaatgaa jtcf i tgaggaca zctt tact a i aaa a tacag 360 

tacggcaact actactacgg tatggacgtc tggggccaag ggaccacggt caccgt.ctcc 420 
tcag 424 



<210> 2 V H 2F2 
<211> 141 
<212> PRT 

<213> Homo sapiens 
<400> 2 



Mftt 






Qlv 




















«Iy 


1 








5 






10 










15 




Val 




Glu 


Val Gin Leu 


Val 


Glu 


Ser 


Gly 


Gly 


Gly 


Leu 


Val 


Gin 








2Q 






25 










30 






Pro 


Gly 


Arg 
35 


Ser 


Leu Arg Leu 


Ser 
4.0 


Cys 


Ala 


Ala. 


Ser 


Gly 
45 


Phe 


Thr 


Phe 


Asn 


Asp. 
SO 


Tyr 


Ala 


Met His Trp 
55 


Val 


Arg 


Gin 


Ala 


Pro 
60 


Gly 


Lys 


Gly 


Leu 


Glu 


Trp 


'Val 


Ser 


Thr He Ser 


Trp 


Asn 


Ser 


Gly 


Ser 




Gly 


Tyr 


Ala 


65 








70 








7:5 










80 


Asp 


Sex- 


Val 


Lys 


Gly Arg Phe 
85 


Thr 


lie 


Ser 
90 


Arg 


Asp 


Asn 


Ala 


Lys 
95 


Lys 


Ser 


Leu 


Tyr 


Leu 
100 


Gin Met Asn 


Ser 


Leu 
105 


Arg 


Ala 


Glu 


Asp 


Thr 
110 


Ala 


Leu 


Tyr 


Tyr 


cys 
115 


Ala 


Lys Asp He 


Gin 
120 


Tyr 


Gly 


Asn 


Tyr 


Tyr 
125 


Tyr 


Gly 


Met 


Asp 


Val 
130 


Trp 


Gly 


Gin Gly Thr 
135 


Thr 


Val 


Thr 


Val 


Ser 
140 


Ser 









<210> 3 V L 2F2 
<211> 382 
<212> DNA 

<213> Homo sapiens 
<400> 3 

gaaattgtgt tgacacagtc v. ccagccacc ctgtctt: tg!; ct:ccagggga aagagccacc 120 
ctctcctgca gggccagtca gagtgttage agctacttag cctggcacca acagaaacct ISO 
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ggccaggctc ccaggctcct catctatgat gcatccaaca gggccactgg catcccagco 240 

aggttcagtg gcagtgggtc tgggacagac ttcactctoa ccatcagcag cetagagcct 300 

gaagattttg cagtttatta ctgtcagcag cgtagcaact ggccgatcac cttcggccaa 360 

gggacacgac tggagattaa ac 332 

<210> 4 Vi 2F2 

<211> 127 

<212> PRT 
2 1 

<400> 4 





Ghl 


Ala 




Ala 


Asa Aaa sau Aha Aaa; Aeu 


Aaa Aaa. 


Asp 


has. 


£ s a 


1 










10 






15 




Ass Aha 


Thr 


Giy 


Glu 


He Val Leu Thr Gin Ser 


Pro Ala 


Thr 


Leu 


Ser 








2 0 




25 




30 






Leu 


Ser 


Pro 


Gly 


Glu 


Arg Ala Thr Leu Ser Cys 


Arg Ala 


Ser 


Gin 


Ser 






35 






40 


45 








Val 


Ser 


Ser 


Tyr 


Leu Ala Trp Tyr Gin Gin Lys 


Pro Gly 


Gin 


Ala 


Pro 




50 








55 


60 








Arg 


Leu 


Leu 


lie 


Tyr Asp Ala Ser Asn Arg Ala 


Thr Gly 


He 


Pro 


Ala 


65 










70 75 








80 


Arg 


Phe 


Ser 


Gly 


Ser 


Gly Ser Gly Thr Asp Phe 


Thr Leu 


Thr 


lie 


Ser 










85 


9.0 






95 




Sex 


Leu 


Glu 


Pro 


Glu 


Asp Phe Ala Val Tyr Tyr 




Gin 


Arg 


Ser 








100 




105 










Asn 


Trp 


Pro 


lie 


Thr 


Phe Gly Gin Gly Thr Arg 


Leu Glu 


lie 


Lys. 








115 






12 0 


125 









<2TQ> 5 V H 7D8 

<211> 424 

<212> DNA 

<213> Homo sapiens 

<400> 5 

AlAAAhlah^ 60 

gtgcagctgg tggaguctgg gggaggcttsg gtacagccag acaqgccaaat qagactctcc 120 

tgtgcagcct ctggattcac ctttcatgat tatgccatgc actgggtccg gcaagctcca 18 0 

gggaagggcc tggagtgggt ctcaactatt agttggaata gtggtaccat aggctatgcg 240 

gactctgtga agggccgatt caccatctcc agagacaacg ccaagaactc cctgtatctg 300 

caaatgaaca gtctgagagc tgaggaaacg gccttgtatt actgtgcaaa agatatacag 36 0 

tacggcaact actactacgg tatggacgtc tggggccaag ggaccacggt caccgtctcc 42 0 

tcag 424 
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<210> 6 V H 708 
<211> 141 
<212> PRT 

<213> Homo sapiens 
<4 00> 6 









GXv 




Trr? 


' 1 . 


Ah.e 




lie 


l ; y;?; Gly 


1 








S 












15 


Val 


Sin Cys 


Glu 


Val Gin 




Val 


Glu 


Ser Gly Gly 


Gly Leu 


Val Gin 








20 












30 




Pro 


Asp 


Arg 




Leu Arg 


Leu 


Ser 
40 


Cys 


Ala Ala Ser 


Gly Phe 


Thr Phe 


His 


Asp 


Tyr 


Ala 


Met His 


Trp 


Val 


Arg; 


Gin Ala Pro 


Gly Lys 


Gly Leu 




50 








55 






60 






Glu 


Trp 


Val 


Ser 


Thr He 


Ser 


Trp 


Asn 


Ser Gly Thr 


He Gly 


Tyr Ala 


65 








70 








75 




80 


Asp 


Ser 


Val 


Lys 


Gly Arg 


Phe 


Thr 


lie 


Ser Arg Asp 


Asn Ala 


Lys Asn 










85 








90 




95 


Ssr 


Leu 


Tyr 


Leu 


Gin Met 


Asn 


Ser 


Leu 


Arg Ala. Glu 


Asp Thr 


Ala Leu 








100 








105 




110 




Tyr 


Tyr 


Cys 


Ala 


Lys Asp 


He 


Gin 


Tyr 


Gly Asn Tyr 


Tyr Tyr 


Gly Met 






115: 








120 






125 




Asp 


Val 


Trp 


Gly 


Gin Gly Thr 


Thr 


Val 


Thr Val Ser 


Ser 






130 








135 






14 0 







<2T0> 7 V L 708 

<211> 382 

<212> DMA 

<213> Homo: sapiens 

<400> 7 

.^kSS:SAjll;iAL.£i!l;il AlAArlA A : : . . A:. A\ ^ 1^.:^. . . i?^ 1 ; X 1 : l 1 :^.^ I?.?. . ^ ^ ^^HSS^^ 60 

gaaattgtgt tgacacaqtc ;:.ccagccacc ctg cc t: >: tgt etccagggga aagagccacc 120 
ctctcctgca gggccagtca gagtgttagc agctacttag cctggtacca acagaaacct 180 
ggccaggctc ccaggctcct catctatgat gcatccaaca gggccactgg cafccccagcc 24 0 
aggbtcagtg gcagtgggtc tgggacagac ttcactctca ccatcagcag cctagagcct 300 
gaagattttg igtttatt itcagc : it, itcac cttcggccaa, 360 

gggacacgac tggagafctaa ac 382 
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<21C> 8 V L 7D8 

<211> 127 

<212> PRT 

<213> Homo sapiens 

<400> 8 









•??:© Ala 


Sin t:ssu L«u 




Let! 






Trr.- 






1 






S 






10 








15 






Thr 


Thr 


G-y Glu 


He Val Leu 


Thr 


Gin 


Ser Pro 


Ala 


Thr 


Leu 


Ser 








20 




25 








30 






Leu 


Ser 


Pro 


Gly Glu 


Arg Ala Thr 


Leu 


Ser 


Cys Arg 


Ala 




Gin 


Ser 






35 




40 








45 








val 




Ser 


Tyr Leu 


Ala Trp Tyr 


Gin 


Gin 


Lys Pro 


Gly 


Gin 


Ala 


Pro 




SO 






55 






SO 










Arg 


Leu 


Leu 


He Tyr 


Asp Ala Ser 


Asn 


Arg 


Ala Thr 


Gly 


lie 


Pro 


Ala 


65 








70 






75 








80 


Arg 


Phe 


Ser 


Gly Ser 


Gly Ser Gly 


Thr 


Asp 


Phe Thr 


Leu 


Thr 


He 


Ser 








85 






90 








95 




Ser 


Leu 


Glu 


Pro Glu 


Asp Phe Ala 


Val 


Tyr 


Tyr Cys 


Gin 


Gin 


Arg 


Ser 








100 




105 








110 






As n 


Trp 


Pro 


He Thr 


Phe Gly Gin 


Gly 


Thr 


Arg Leu 


Glu 


He 


Lys 








IIS 




120 








125 









<210> 9 V a 11B8 

<211> 4 33 

<212> DNA 

<213> Homo sapiens 

<400> 9 

TTGSil\TTTGii^ 50 

gttcagctgq Lgcagtctqg gggaggci;Lg gtaoa ::cctg gggggtccct gagactctcc 120 

tgtacaggct ctggattc? ~ - ^ j i -t t i i |j -v. , i8Q 

ggaaaaggtc tggaatgggt atcaattatt gggactggtg gtgtcacata ctatgcagac 240 

tccgtqaagg gccgattcac catctccaga gacaatgtca agaactcctt gtatcttcaa 3 00 

atgaacagcc tgagagccga ggacatggct gtgtattact gtgcaagaga ttactatggt 360 

gcggggagtt tttatgacgg cctcfcacggt atggacgtct ggggccaagg gaccacggtc 420 

accgtctcct cag 433 
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<210> 10 V H 11B8 
<211> 144 
<212> PRT 

<213> Homo sapiens 



<400> 10 









Sly 




Ssr Tro 




Asu 




Ai a 


n » 




r,v s 




1 








5 






10 










15 








CVSi 


Glu 


Val 


Gin Leu 


Val Gin 


Ser 


Gly 


Gly 


Gly 


Leu 


Val 


His 








2:0 
















30 






Pro 


Sly 


Gly 




Leu 


Arq Leu 


Se^ Cys 


Thr 


Gly 


Ser 


Gly 


Phe 


Thr 


Phe 






35 








40 








4-5 








Ser 




His 




Met 


His Trp 


Val Arg 


Gin 


Ala 


Pro 


Gly 


Lys 


Gly 


Leu 




50 








55 








60 










Glu 


Trp 


Val 


Ser 


lie 


lie Gly 


Thr Gly 


Gly 


Val 


Thr 


Tyr 


Tyr 


Ala 


Asp 


65 










70 






75 










80 


Ser 


Val 


Lys 


Gly 


Arg 


Phe Thr 


lie Ser 


Arg 


Asp 


Asn 


Val 


Lys 


Asn 


Ser 










85 






90 










95 




Leu 


Tyr 


Leu 


Gin 


Met 


Asn Ser 


Leu Arg 


Ala 


Glu 


Asp 


Met 


Ala 


Val 


Tyr 








10.0 






105 










110: 






Tyr 


Cys 


Ala 


Arg 


Asp 


Tyr Tyr 


Gly Ala 


Gly 


Ser 


Phe 


Tyr 


Asp 


Gly 


Leu 






115 








120 








125 








Tyr Gly 


Bet 


Asp 


Val 


Trp Gly 


Gin Gly 


Thr 


Thr 


val 


Thr 


Val 


Ser 


Ser 




13 0 








13 5 








140 











<210> 11 V L ilB8 
<211> 382 
<212> DNA 

<213> Homo sapiens 
<4 00> 11 

'w - it ' icacagl i ccagccacc ctgtctttgt ctccagggga aagagccacc 120 
cfcctcctgca gggccagtca gagtgbtage agctacttag cctggtacca acagaaacct 180 
ggccaggctc ccaggctcct catctatgat gcatccaaca gggccactgg catcccag.cc 240 
aggttcagtg gcagtgggtc tgggacagac ttcactctca ccatcagcag cctagagcct 300 
gaagattttg cagtttatta ctgtcagcag cgtagcgact ggccgctcac tttcggcgga 360 
gggaccaa^ g ? i -aa ac 3 82 



UAGAAM 



<210> 12 V L 11B8 

<211> 127 

<212> PRT 

<213> Homo sapiens 

<400> 12 

1 5 10 ~ii 

Thr Gin Ser Pro Ala Thr Leu Ser 
2 0 25 3 0 

Leu Ser Pro Gly Glu Arg Ala Thr Leu Ser Cys Arg Ala Ser Gin Ser 

35 40 45 

Val Ser Ser Tyr Leu Ala Trp Tyr Gin Gin Lys Pro Gly Gin Ala Pro 

50 55 60 

Arg Leu Leu lie Tyr Asp Ala Ser Asn Arg. Ala Thr Gly lie Pro Ala 
65 70 75 80 

Arg Phe Ser Gly Ser Gly Ser Gly Thr Asp Phe Thr Leu Thr lie. Ser 

85 90 95 

Ser Leu Glu Pro Glu Asp Phe Ala Val Tyr Tyr Cys Gin Gin Arg Ser 

100 105 110 

Asp Trp Pro Leu Thr Phe Gly Gly Gly Thr Lys Val Glu lie Lys 
115 120 125 
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Exhibit B 



<210> 2 V H 2F2 (Additional amino acid in V E CDR1 region has been underlined.) 
<211> 141 
<212> PRT 

<213> Homo sapiens 
<400> 2 



i-G- 1 


GLu 


L<?:« 


Gly 


.U:U SftV 




i 1 s; 


:Gv- 




T..>;:: A 


.Via 




Lax- 


«vs 


01 v 


















10 














V«l 


Gin 


Cy- 


Glu 


Val Gin 


Leu 


Val 


Glu 




Gly 


Gly 


Gly 


Leu 


Val 


am 


























30 






Pro 




35 


Ser 


Leu Arg 


Leu 


Ser 
4 0 




Ala 


Ala 


Ser 


Gly 
45 


Phe 


Thr 


Phe 


Asn 


Asp 


Tyr 


Ala 


Met. His 


Trp 
55 


Val 


Arg 


Gin 


Ala 


Pro 
60 


Gly 


L y s 


Gly 


Leu 


Glu 


Trp 


Val 








Trp 










lie 






Ala 


65 




























80 


Asp 


Ser 


val 


Lys 


Gly Arg 
85 


Phe 


Thr 


lie 


Ser 
90 


Arg 


Asp 


Asn 


A i :l 


Lys 


Lys 


Ser 


Leu 


Tyr 


Leu 


Gin Met 


Asn 


Ser 


Leu 
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<210> 10 V a 11B8 (Additional amino acid in V H CDR1 region has been 
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Figure I 

A screenshot is shown of part of the germiine V K signal peptides in the Vbase database. 

VH Leader - Amino acid sequence alignmeiit 
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Figure 2 

A screenshot of an analysis performed using the DNAplot module on the Vbase site. The 
complete DNA sequence of the V L 11B8 antibody was analyzed and an alignment is shown of 
this sequence with the germiine mature V,. regions in the database. 
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Fsgisre 3 

A screenshot of an analysis performed using the VQUEST module on the IMGT site. The 
complete DNA. sequence of the V L 1188 antibody was analyzed and an alignment is shown of 
this sequence with the germiine mature V L regions In the database. 
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Figure 4 

The V H poiypeptide sequence of antibody 7D8 was analyzed using the SignalP server. A 
screenshot of the resulting prediction of the signal peptide cleavage site is shown. 
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ABSTRACT 

IMGT, the international irrsMunoGeneTics database, 
freely available at http://imgt.cines.fr:S104, was 
created in 1989 at the Universite Montpellier !I, CURS, 
Montpellier, France, and is a high quality integrated 
information system specialising in immunoglobu- 
lins, T cell receptors and major histocompatibility 
complex molecules of human and other vertebrates, 
IMGT provides researchers and clinicians with a 
common access to all nucleotide, protein, genetic 
and structural immunogenetics data. This information 
is of high value for medical and veterinary research, 
biotechnology related to antibody and T ceil receptor 
engineering, genome diversity and evolution studies 
of the immune response. 

INTRODUCTION 

IMGT, the international ImMunoGeneTics database (hup:// 
imgt.cmcs.fr:8104) (1, 2), created in 1989 at the Universite 
Montpellier II, CNRS, Montpellier, France, is a high quality 
integrated information system specialising in Rnrnunogiobu- 
! ti ) ) iti! Re i >j K'i i N hi i it 
1 v pk Ml" i ' ' human and other vertebrates. 

IMGT pro i 1 1 1 1 t \ it ' i 

the gendrne i t i tructure of the Ig, TcR 

and MHC. Due to its high quality and easy data distribution, 
IMC has sin) t it i t t 1 t 

in autoimmune diseases, AIDS, leukemias, lymphomas, 
r i . t i ,v i ii is t i 1 \ ! t t 

antibody engineering, veterinary research, genome diversity 
and genome evolution studies of the immune response. IMGT 
consists, of databases. f'TMGT sequence databases'). Web 
resou i ( IMC 1 \tni n i . r> five tools 

(Fig. 1). 

IMGT SEQUENCE DATABASES 

The IMGT sequence databases comprise at present (it IMGT/ 
LIGM-DB, a comprehensive databa.se of 41 248 Ig and TcR 
nucleotide '.rev- f» . 1 . i u jo t ' 1 other vertebrate 
spec i v in iaiNbt ( t" iK k< c ted 

by L1GM (Uboratoire d'immunoGenetique Moleculaire, 

' j it i e ,i ii i> ^ ^ O . il s 
the 1257 human MHC allele sequences, developed by ICRF 



(imperial Cancer Research Fund, Oxford, UK) and ANRI 
(Anthony Nolan Research Institute, London, UK) (4), 

!MGT MARiE-PAULE PAGE 

The IMGT Marie-Paule page comprises Web resources 
^ i il MI it 1 in !i - t i-a i t G 

externa! hyperlinks). 

The IMGT Scientific chart pjv GJ it > bum 

and the annotation rules and cot ^< i by IMGT i 

H it ! illll I 1 Mill . Ohl id 

data of ail vertebrate species (5). The concepts of classification 
have been used to set up a unique nomenclature of Ig and TcR 
em (6,7), which has been adopted by the HUGO nomenclature 
committee. The complete list of the IMGT human Ig and TcR 
gene names; has been entered tn GDB and LocusLink ( 1999). A 
uniform numbering system for Ig and TcR sequences of all 
m ks i Kt. i ^si ih-di t t 

and cross-referencing between experiment* from different 
l.i kn 1M il t , t - 1 .i i „ or f.si v inn 
type or species (ti). IMGT has developed a formal specification 
of the terms to be used in the domain of immunogenetics and 
bioinform s to ensu eci < 1 i l u lL v 

in IMGT, This has been the basis of the 1MGT-ONTOLOGY (5), 
the first ontoh n'i> 1 i r * 
the immunogenetics knowledge lor all vertebrate species, 
nii.ol i ! i l i i 1 1 i t in < 1 1 

and biological data evaluation (9,10). 
The IMGT Repertoire, the global Web resource in ImMuno- 
c t i tit' lobul md T ceil receptors 

human and other vertebrates, based on the 'IMGT Scientific 

1 It 1! 1(1 s t 1 ! 1 s 1 t 1 i (I 

i ne, proteome, pol 1 structure of 

the Ig and TcR (2). Genome data include chromosomal 
localisations, lot > nd gt 1 < 'cut rubles 

Proteome and polymorphism data are represented by protein 
ispl s 1 i i r 1 . ! ! i ! esc data 

are regularly published tn the IMGT Locus in Focus section of 
Ex{»>rimenia! ami Clinical Immunogenetics (IMGT indexMMGT 
cus in Focus at http://jrngt.cin ; 104). Sin lit H data 
comprise I graphical ^ t c t i i i designated as Colliers 
t Pet e i nd D epreseni ms < t feR riablc 
n ons 2,3). Thi lisation permits - eft tio i 

between protein sequences and 3D data retrieved from the 
Protein Data Bank (PDB). A new section, currently being 
developed, contains data cm probes used lor tire analysis of ig 
and TcR gene rear angemenis and expressions md RF1 P 
i i phism) studies 



r- - n F + pi -r ^ i r i 



Exhibit C(2) 



208 Nucleic Acids Research, 200 J, Vol. 29. No. I 




The- IMGT Index and the IMGT Aide-memoire represent 
useful k'Mul tM.it i! s k i s i i 11 
IMGT Bloc-noses provides numerous hyperlinks to the Web 
servers specialising in immunology, genetics, molecular 
biology and bioinformatics (1 i). 

iMGT TOOLS AHD OTHER ACCESSES 

IMGT/V-QllBST (V-QUEry and STandardization) is an 
integrated software for Ig and TcR. This tool analyses the input 
1 r 1 \ i i 1 leotia u \ i 

nucK 1 i 1 " parison wi he IMGT refers 

net t! t i t u i m 1 i i i i 
T L 1 1 t i t t ^ C CI i DBcanbe 

searched by BLAST <>r FASTA on different servers. 

Since July 1995, IMGT has been available on the Web at 
http://imat.cinc-s.fn8 1 (.W. IMGT provides biologists with an 
easy to use and friendly interface. From January 1996 to 
October 2000, the IMGT WWW server at Montpellier was 

a s b e than < >0 t IMGT has an exceptional 



response with more than 6000 rent s .. \ 1 1 \V,T U>i i 
j'so dsst tbutcd >s tm t stiibution of CD-ROM, network 
fileserver: netserv@ebi.ao.uk, and anonymous FTP server), 
and from many SRS (Sequence Retrieval System) sites. To 
meditate the integration ot IMGT data t applications developed 
by . t ul ii < e i built an API (Application Program- 
ming interface! to access the database and its software tools 
(10), 



ELECTRONIC AND MAILING ADDRESSES 

IMGT home page: ht!p://tmgt.cines.fr:8104 (IMGT contact 
lefranc@ligm.igh.cnrs.fr). 

IMGT page at EB1: http://wwv.ebi. ac.uk/irng:, ftp:// 
ftp.ebi.ac.uk/Dub/databases/tmgt 

IMGT/LJGM-DB: http://tmgt.cin.es. ft: 8 104 (contacts 
ftp:// 

i m ( i . < to i it 
IMGT/HL A-DB : http://www.ebi.uc.uk/imgt/iiia/ (contacts 
j < binso bi.ac i\ til 1 rf.icnet.ul is i@i< icnetul 
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\i<\ rs \ } i S j i i I 

should cite this article as a general reference for the access to 
and content of 1MGT. and quote the MOT home page URL, 
http '/tti gtxii e s fr:S104. 
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Abstract 

J / it n't. r >iem<s and deliver them wherever they are needed: into different internal 

compartments called organelles or even out of the cell altogether. One of the most essential features of the ZIP code system if. the signal 
-ciiuviK : ..i .Jdu v- um, ' s.hicn . , !..-•<.,,!'• s.-nt tu the N-temrinal part of the protein and is trimmed away by the time it is secreted. 
Owing to the importance of signal peptides for understanding the > t n mechanisms of genetic diseases, i i i t tu cells for gene 
l u c i i i i f it 1 l i i i t ir 1 I l 1 i t n ! e irate method to identify 

the signal peptides. In this paper, a sealed window model is proposed. Based on such a model as well as Markov chain theory, a new 
j_ mm milated pret it i peptide f i 'i- t he 1939 sec ot proteins and 1440 non-secretary proteins have 
indicated that the new algorithm is particularly successful in the overall success rate, and hence can serve as a complementary tool to the 
existing algorithms for signal peptide prediction. © 2001 Elsevier Science Inc. All rights reserved. 

KeyWords^'Zi^eoie" sequence; Markov chain; Secretory proteins; Non-secretory proteins; Discriminant function 
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1. introduction 

Proteins with various functions are constantly being 
made within cells. These nascent proteins have to be trans- 
ported either out of the cell, or to the different organelles 
within the cell. How are they transported across the mem- 
brane surrounding the organelles? And how are they di- 
rected to their correct location? These questions have been 
answered through the work of Giinter Blobel. the Nobel 
Laureate of last year in Physiology or Medicine [13]. Me 
and bis co-workers have discovered that newly synthesized 
proteins have an intrinsic signal peptide, functioning as 
"address tags" or "zip codes" that is essentia! for direct 
them wherever they are needed. 

The discovery of signal peptides has had an immense 
impact on modern cell biological research. Knowledge of 

i eptides is helper p i 1 role ui 
t is ts be id sec ?;enc c t ses x * ie cell d i cx 
large amounts of proteins are being made and new or- 
ganelles are formed. If a sorting signal in a protein is 
changed, the protein could end up in a wrong cellular 
location and cause varieties of diseases. For example, in 



some forms of familial hypercholesterolemia, a very high 
level of cholesterol in the blood is due to deficient transport 
signals. Also, heredita iise such \ fibrosis ate 
caused by the fact that proteins do not reach their proper 
destination. Knowledge of signal peptides will increase our 
understanding of processes leading to di sease and hence can 
be used to develop new therapeutic strategies. 

Today some drugs have already been produced in the 
form of proteins, e.g. growth hormone, insulin, and hemo- 
globin. Usually bacteria are used for the production of 
protein drugs. However, in order to be functionally proper, 
it is necessary to synthesize human proteins in more com- 
plex cells, such as yeast cells. The contemporary gene 
technology allows us to generate the genes of the desired 
proteins with sequences coding for transport signals. The 
cells with the modified genes can be efficiently used as 
"protein factories." Accordingly., knowledge of protein sig- 
nals can then be used to reprogratn cells in a specific way 
for future cell and gene therapy. Actually, protein signals 

v i < i i i i Htt 1 tins ti'lnct 

drugs that arc targeted to a particular organelle to correct a 
fi ect. For exam 1 me tag to the 

desired proteins, one can, for instance, lag (hem for excre- 
tion, making then) ranch easier to harvest j!3'J. 

vctua lj the idc i I jas become 

I i K i> sn 1 ,tt e ffeetive 
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Fig. 1 . A schematic drawing to show th v - signal sequence ofa protein arid how a is cleaved by the signal peptidase. An amino acid in the signal part is depicted 
, ).\kuu i 1 n h t i d,i i r u i > i ul i i u \ us il- !h 

ii i , i i r i i 1 s i >l the 

J sequent t si i ' ate protein 



However, since the number of protein sequences entering 
into data banks has been rapidly increasing, it is time- 
consuming and cost 1\ to identify peptides entire!) 
by experiments. For example, the yearly increment of se- 
quence entries in SWISS-PROT [1] in 1987 was 1,266, and 
that in 1988 was 3,497, but that in 1997 was already 10,092. 
In view of this, it is highly desirable to develop an auto- 
muted algorithm to identify signal peptides of newly syn- 
thesized proteins. The existing methods for predicting the 
signal peptides are based mostly on the use of neural net- 
works (see, e.g. [11,16]). As pointed out by King [14], the 
advantages of neural network prediction methods are: (I) 
"readily available," and (2) "often successful in practice"; 
the disadvantages are: ( I) "very poor explanatory power," 
(2) "little use of chemical or physical theory," and (3) 
"statistically rather poorly characterized," Besides, although 
the computational costs for training the networks was con- 
siderably higher, the prediction accuracy thus obtained was 
not always higher (and sometimes even lower) than the 
analytical methods. The current study was initiated in an 
attempt to develop an automated analytical method to pre- 
dict signal peptides based on a seated window model and 
Markov chain theory (see, e.g. Bhat, [2]). 



2. trials ant! methods 

Signal peptides comprise the N-terminai part of the se- 
cretory protein chain. They control the entry of virtually all 
proteins to the secretory pathway, both in eukaryotes and 
prokaryotes [J 2, 18,21], and axe cleaved off by signal pep- 
tidase (Fig. t) while the protein is translocated through the 
membrane. As shown in Pig. 1 , the cleavage site is at the 
sequential posit. or be 1 and +1 Accord- 



■ i > ii > ' i t, 1 a 

with the prediction of the secretion-cleaved site. The length 
of signal peptides is varied for different secretory proteins. 
As shown in Pig. 2, of the 1939 signal peptides studied by 
Nielsen et ai. [17], one (the shortest) contains 8 amino acid 
residues, one (the longest) contains 90 residues, and the 
majority is within the range of 18-25 residues. The extreme 
variation m length and a, tits, net has made ii a herculean 
task to formulate a general algorithm for predicting the 
signal peptides. To deal with this kind of situation, let us. 
consider a window with a scale of . . . , —3, —2, —1, 
+ 1, +2, , . . , & (Fig. 3). Such a window is called "scaled 
wmdow* ! and symbolized as [-£,, +£?]. When sliding the 
scaled window [-£,, +£J along a sequence of n residues, 
i i i 1 i sc ences. Of the 

sequence segments thus generated, only the one with the 
residue at the scale -1 being the very last, residue of the 
signal sequence is deemed as the secretion-cleavablc seg- 
ment (Fig. 3a), while all the other segments deemed as 
non-secretion-cleavable (Pig. 3b and e). By this way. if 
sliding the scaled window [-£„ wifoi along a protein se- 
quence of n residues, one can generate one, and only one, 
secretins -de e me md « - , t ' -d non-secrc 
tioH-cleavable segments if the protein is secretory; but n - 
(£i + €2) + 1 non-secretion-cieavabic segments if it is 
non-sectertory All the secretioi , t • icnts form a 

positive set denoted bj S md i h ere on-cieav 

able segments form a negative set S~ 

Segments generated by sliding the scaled window [•••£„ 
+ £J tiong protein seqi - ' expre ed t 



R--£,R--o«,-n 



■R_jR_ 2 R_,R + ,R +2 
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where K... f represents the residue at the scale R_j the 
residue at the scale -l,R + , the residue at the scale + 1, and 
so forth. 

If the amino acid residue at each of the segment subsites 
(eq.l) can be treated as an independent element, i.e. there is 
no coupling a! all among ihe.se subsites, then its attribute to 
the cleavabic set S " and that to the non-cleavabie set S~ can 
be formulated, respectively, as [4] 

■ • ■ R_ 3 R_ 2 R-iR + ,R+2 ' • -& h) 

= J»±. tr €it_ ft ) • • • Pl 3 (R- i )Pi 2 (R-. l )Pt l (^i) 

/' • • • P.l fe (R +fe ) (2a) 



%' (R- € , ■ 



. 3 R_ 2 R_ 1 R +1 R +2 ■ • ' R+jj) 



- p-: €i (R_ fl ) ■ ■ ■ p: ? (r_ 3 )p= 2 (r_,)/ > ;,(R, 1 ) 



p: 2 (R +2 )-'-p: 6 (R., fe ) 



here PJ (R.) 
: the subsite , 



the 



ivable 



md Pr 



spondmg probability for the non-secrelion-cleavable seg- 
ments. The values of the former can be derived from a 
positive trail ing d i ing of only secretion- 

cleavable segments, and the values of the latter can be 
derived from a negative training dataset consisting of 
only non-cleavable segments. The subscript 0 of ^ indicates 
that the attribute function is formed by independent proba- 
bilities in which no coupling effect whatsoever among sub- 
sites is h It ght-hand side of eq. 2. 
However, in reality the protein subsites are often coupled 



with one another. If the coupling effect of a residue with 
those adjacent to it must be taken into account, then the 
probability factors in eqs.2a and 2b should be modified as 

* + (R_ 6 • • • R_ 3 R_ 2 R_ 1 R +1 K +2 ■ - ■ R +fe ) 

-Pl^CR.-.g,)?:^-,^..^-,)^-!,) 

• • • P.: 2 (R_ 2 |R_3)P-i(R- i|R- 2 ) (3a) 

P^i(R +1 jR_i)Pi 2 (R +2 |R +1 ) 

■••Pt4(R + ,jR + t ft -.)) 

and 

*-(R_g, • ■ • R_ 3 R_ 2 R_,R +l R +2 • • ■ R. /; i 
= PI |l (R_ fi )PI ( ^ 1) (R_« l _ w lR_ ll ) 
• • • PZ 2 (R„ 2 |R_ 3 )PZ , (R_ , |R_ 2 ) (3b) 

p; ! (R +) |R_ I )p; 2 fR + j}R +1 ) 



(R +1 



are ihc 



respectively, where Pt (i (R_^) and P... Ci 
same as in eq 2a and eq. 2b. PZ tS ,~ x) (R... (! ; i - V) \H- £l ) is the 
probability of amino acid R.. (f .,, occurring at the subsite 

c r l), given that R has occ i bsit £ ( ;P 

(R„ 2 jR_ 3 ) is the probability of amino acid R.. 2 occurring at 
the subsite -2. given that R_ 3 has occurred at the subsite -3; 
and so forth. Their values can be derived from a positive 
training dataset s£ consisting of only secretion-cleavabie 
peptides. And PZ^-u (R-^-^jR-f,), P: 2 (R_ 2 |R._ 3 ). . . , 
have the same meaning as Pl (f| _ 0 (R_ ( j r . 0 |R_g i ), 
P.t 2 (R_ 2 |R~j) t ... , except that they are derived from a 



!976 
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(a) 



(b) 



(c) 




Fig. 3:. IHust -ui< i t 1^1 i ^ i'ii i e niliw i (. krmm 1 (e), the 

i.caie:, on the window arc aligned was dh'tCrenl amine aewk: ?t , as to define dat'eisaa peptide sc-y int'ij ;s. When, and only when, the scale I i aligned with 
tit 'a t r i I i u i i ' i i t < i el iHixputo at un 

t , i l '[ii in in i n . i t i rel ( :) nd 

(c), are regarded as non-secretion-clca I mi 1 r 1 i i [ ^ f r s I 1 i ' u t \ h >1 t . md hi i> in 
the i protein by hiack oharaetei's with while background. 



negaiive training dataset 5,7 consisting of only nori-eleav- 
able peptides. 

Generally speaking, if the coupl ing effects of the \i (ju. = 
2, 3, , , . ) closest aeighboring amino acid residues need to 
be considered, then eq. 2 should be modified according to 
the jath-order Markov chain theory, i.e. the attribute func- 
tion ¥ 0 should be replaced by and the corresponding 
probability factors by the yxth-order conditional probabili- 
ties. As one could surmise, the analysis of a higher-order 
Markov chain would be much more complicated. Therefore, 
the treatment in this paper is confined to the first-order 
Markov chain; i.e. only the first-order sequence-coupling 
effect is taken into account, as formulated by eq, 3. 

Tin s k i ei f - de vi s dei i t j 
its attribute tunc oi ol ositi - aining sc s is gj t 
than that to the neg u<> c training set S~, i.e. ^ + > then 
the sequence is predicted io be secretion-cleavable; other- 
wise, ii is predicted to be a non-secretion-cleavabie. We 
define a discriminant function A, given by 

A(R_ fl • ■ •R_ J R_ 2 R_ 1 R + iR +2 • ■ • R ;+fe ) 

- ■ ■ ■ R- ? R-. 2 R-. ,R +1 R +2 • ■ ■ R +6 ) 

- w~^~(R_ 6 ■ • ■R^ 3 R_ 2 R_,R +1 R +2 R +& ) 

<4) 

where w* and w are tire weight factors for the attribute 
functions derived from the posit. ii ins dataset S£ and 
negative training dataset j?T, respectively. If there is no 
special reason, they are generally set to be one; i.e. w* = 



w~ = 1. Thus, the criterion of the seretion-cleavable seg- 
ment prediction tor a given sequence can be formulated as 
follows: 

("The segment is secretion-cleavable, if its A > 0 

(5) 

' 1 lit. i ivable othe w ^ 

During the training process, the parameters £, and <: 2 can be 
changed so as to find the optimal jpredictiofl quality. Once a 
secretion-cleavable segment is predicted, the corresponding 
cleavage site and signal peptide are automatically obtained 
as described above (cf. Fig. 3), 

3. Results 

To compare the power of differe i di ionalgoril m 
one must use a same data base; otherwise, the comparison 
would be meaningless. In view of this, we should adopt a 
dataset that is accessible to the public. The dataset investi- 
gated by Nielsen et ah [ 17] satisfies such a prerequisite; it 
can be retrieved from an FTP server at ftp://vinss.cbs.dtu.dk/ 
i ;ij n ip T a d< _ r sts 193 reter, >rot ins 
and 1440 non-secretory proteins. The former contains 416 
human : , 1011 eukaryote, 105 E.coli, 266 Gram-, and 141 
Gram+ .proteins; while the latter 251 human, 820 eu- 
karyote, 119 E.coli, 186 Gram-, and 64 Gram+ proteins. 
Redundant sequences were removed to guarantee that no 
pairs of homologous sequences exist in the data set. For the 
secretory proteins, the sequence of the signal peptide and 
the first 30 amino acids of the mature protein were included 
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in the datset. while for the non-secretory proteins, the first 
70 amino acids of each sequence were included. Further- 
more, to show the power of tire current algorithm, the 
comparison should be made with the best result derived 
from the f taset a- the previous investi- 

gators. According to the report by Nielsen et ah £17],. the 
average overall rate of correct prediction for the cleavage 
site was about 72%. As pointed out by Nielsen e! ah [1.7], if 
•'the original weight matrix algorithm [201 is applied to" 
their dataset, "the performance is much tower." Now let us 
apply the current prediction algorithm to the same dataset as 
used by Nielsen et al. [17], and see what results will be 
yielded. 

The rates of correct prediction for the signal peptide set 
and non-signal peptide set are given by 



;nal peptides 



for non-signal peptides 



(6) 



where N + represents the total number of signal peptides, 
and m~ is the number of signal peptides missed in predic- 
tion; TV" ~ is the total number o ura-stgnal p 
is the number of non-signal peptides incorrectly predicted as 
signal peptides. The overall rate of correct prediction con- 
cerned is given by 

A"*W + + A~N~ 1 m + + m~ 
N + + N~ 



N + + AT (7) 

We might have the situation of overprediction if the rate 
A* is very high but A " very low; i.e. many non-signal 
peptides cue incorrectly predicted as signal peptides. On the 
other hand, we might have the situation of underproduction 
if the rate A + is very low but A" very high; i.e. many signal 

[vl l4 U.UL i I >. no I ' s 

Accordingly, the real prediction accuracy should be mea- 
sured by A, the overall success rate. However, since the 
number of non-signal peptides is much greater than that of 
the signal peptides, we might also have the unpleasant 
•Situation where the overall rate A is very high but A very 
lov The hi > ^ ii isc to solve this kind of situation is 
to keep AN with a decent rate while seeking for the highest 
rate for A . What rate for A" is decent? For the current case, 
it should be at k higl than 72 

The prediction quality was examined by the standard 
testing procedure in statistics [15] that is a combination of 
the self-consistency and jackknife tests. In the former, the 
signal peptide of each protein in a given dataset was pre- 
dicted using the parameters derived from the same dataset, 
the so-called set while in the latter, each pro- 

tein in the training dataset was singled out in turn as a "test 
protein" and all the rule-parameters were derived from the 



remaining proteins. Compared \< 

test and sub-samphng test often adopted in biology, the 
jackknife test is thought the most effective method for 
cross-validation in statistics [15] This is because in the 
independent dataset test, the selection of a testing dataset is 
rl and the aceurai d i t; obj active 

criterion in L e • s suffk y large [71. 

As for the sub-samp 1 est i rich a dataset i 

divided into several subsets, the problem is that the number 
of possible divisions might be too targe to be handled. For 
example. In the treatment by Nielsen et al [17] each dataset 
was divided into five approximately equal size parts and 

data and the other tour parts as training data. The perfor- 
mance measures were then calculated as an average over the 
five different dataset divisions. Thus, according to their 
cross-validation scheme, the number of possible sub-sam- 
phng combinations would be, where T ------ X <t> Non , 

where 4> Sefc = 3>f," m > 1 \ k 1 ^ i f 

is the number of po^ i 1 i ions In th 

dataset of secretory proteins and <t> Nt,n = T>^';". x x 
I< d\ " the number of possible sub- 

sampling combinations in the dataset of non-secretory pro- 
teins. For the data studied by Nielsen et ah [17], we have 
®vL = 41 (83183 13184!) *|^ k - 1011!/ 
(202 12021202! 202! 202!), <J>|~ = 1051/(21 !21 !21!21!21 !), 



n'"N 



266!/(53!53!S3!53!54!.), d>[i;; m + 



n . 2"'» t o (| - ' ( ^0 SO' H" ' 
$^ * 8201/(164! 164! 164! 164! 164!), *g£ - UW 
. gin }■: 4-124 23!) 'h m 18i ' (37!37!3?!37!38!\ 

<fc + = 64!/(13!13!l3!13!12!). Of ®H* m , -*|&, . 

and the smallest is ~3.1 X 10 69 , 

implying d> Sec would be > 15.5 X 10 345 , Of ^gg?, 
d J T > nd t u Host s d>{ ," n 

-1.76 X jo 41 , implying S Uoa would be '>8.8 X 10 205 . 
Thus,* = <f> Sec X 4> Non -would be > 1.36 X 10 552 . For such 
;i huge number of combinations, it ss impossible for any 
existing computer to handle. In fact in any practical sub- 
sampling tests as carried out by Nielsen et: al. [17], only a 

vestlgatcd, and the i N thus obtained would unavoidably 
bear a considerable arbitrariness. Accordingly, the testing 
procedure adopted here is much more objective and rigor- 
Prediction was performed by selecting different param- 
eters, for the scaled window [--is, + ex. Preliminary jack- 
knife tes indicated that i t given £, the optima esult for 
A 1 was obtained when f 2 - I. Generally speaking, for the 
s^ i ^ 5S enq < st, the t n 1 co '- f s ediction will be 
increased by widening the sealed window, i.e. increasing the 
values of |, (Table 1). However, for the jackknife test, 
increase of X, after its reaching a certain value will gradu- 
ally reduce --, the rate of correct prediction for signal 
peptides, although the overall rate of correct prediction 
keeps increasing (Table 1). Particularly, if c-j is too large, 
many short signal peptides will be excluded for prediction. 
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Scaled window 


iu ct s 


■a signal peptide set 1 


Number of signal 


Overall success 


rate A" 


[-1^ .i.g^ 


— ; 

S i> 


jaekknife 


peptides excruded b 


Self-consistency 


jack ten tie 


[-6, +1] 


94,43% 


79,78% 


Q 


89.26% 


89.20% 


[-7, +!] 


95.20% 


79.42%, 


0 


90,66% 




[-8, +i] 


96.03% 


78.34% 


0 


92.02% 


91 88% 


\-9, +1} 


96,13% 


78.49%, 




93.11% 


92.90% 


[--10, + 1] 


96.96% 


78.34% 


2 


94.0«% 


93.84% 


t-U.+i] 


97,58% 


77.20% 


2 


94.85% 


94.60% 


1-12, +11 




76.48% 


5 






1-13, +1| 


98.04% 


75.81% 


<S 


95.99%, 


95.75%, 


1-14, +1] 


98.25% 


74.98% 


8 


96.25% 


95.96% 


[-15, +1] 


98.25% 


73,49% 


13 


96.53% 


96.21% 


[-16, +rj 


96.39% 


69,98% 


52 


96.79% 


.96.44% 


" This is the succe 


>a rale tor die i939 stgr 


I peptide »c 1 i i u i 






"Tee excluded signal peptide shonid be c: 


junted as those missed in p re 


:d:cdo:p. atid hence is a part of m" 


as defined in eq. 6. 





As shown in Table 1, when f, s 8, no signal peptides are 
exclude vhcu 11 . e m i c excluded; wha 
£, = 10 or 1 1, two excluded; when | x = 12, five excluded; 
when |] - 13, six excluded; when £, = 14, eight; when 
& 15, thirteen; and so forth. These excluded signal 
wptidt loud 1 be eou U unsttc ft i 
events, To keep the number of excluded signal peptides 
being low and meanwhile to keep A + greater than 72%, we 
select = 1.2-14 and £ 2 = With mese optimal param- 
eters for the scaled window, the number of signal peptides 
excluded are less than 10 but the overall rates of correct 
prediction by both self-consistency and jaekknife tests are 
over 95%. 



4. Discussion 

Since ihc current model is explicitly correlated with the 
sequential coupling along a peptide chain, it will provide a 

ti I i c I %! ii s t ! i u ii i \ i e t 

molecul i k molec ism of ih 

ZIP code protein-sorting system in cells, such as what will 
happen if an amino acid in the signal sequence is replaced 
by another, and how the signal sequence interacts with its 
counterpart of the signal peptidase. Moreover, the present 
method may also be used to improve the protein subcellular 
i . m< < r d Mdi> ca\ i c note essential corre- 
lation of protein location directly with signal sequence, 
rather than the one indirectly with amino acid composition, 
as formulated in a number of papers in this area [3,6,8,9,19] 
and summarized in a recent review article [5]. It should be 
pointed out that the ft n s sented here is i gene a 

one, By some modification, it can be used to study the. 
coupling effects amoung some specific subsites as well [10] 
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COMMIMICATION 

Using stibsite coupling to predict signal peptides 



Kuo-Chen Chou 

Coiiiraita^AaJixi Dr::j 1 t I i and ! 11 

Given a nascent protein sequence, how can one predict its 
signal peptide or 'Zipcode' sequence? This is a first import- 
ant problem for scientists to use signal peptides as a vehicle 
to find new drugs or to reprogram cells for gene therapy. 
Based on a model that takes into account the coupling 
effect among some key subsites, the so-called j-3, -1, +1} 
coupling model, a new prediction algorithm is developed. 
The overall rate of correct prediction for 1939 secretory 
proteins and 1440 non-secretary proteins was over 92%, 
It has not escaped our attention thai She new method may 
also serve as a useful tool for helping investigate further 
many unclear details regarding the molecular mechanism 
of the ZIP code protein-sorting system in cells. 
Keywords; [-3, -1, •!• 1 J coupiing/non-secretory proteins/ 
secretory proteins/'Zipeode' sequence 



Introduction 

The knowledge of protein signals can be used to reprogram 
cells in a specific way for future cell and gene therapy. Protein 
signals have become a crucial tool for nx. trchei > eoi 
new drugs that are targeted to a particular organelle to correct 
a specific defect. For example, by adding a specific tag to the 
desired proteins, one can tag them for excretion, making them 
much easier to n vest < lit gj 1 tnn, 1999). To use such a too! 
s j wss il \ fi i r t I t 1 t ii t S e 

the number of nascent protein s,q ... t i 1 IvpU has 
been rapidly increasing, it is time consuming and costly to 
identify the signal peptides entirely by experiments. Thus, a 
sii ng it t in the autom identificafii oi t t 
sequences and prediction of their cleavage sites has been 

ol ed. The importan f predict!) > i i 
has also been elaborated recently in an excellent review by 
Nakai (2000) 

The existing methods in this area are based mostly on the 
use of n il ne Clan ,li '7 % ' el i oi., 

1999; Nakai, 2000). They are actually the application of 
in I ne le n lin echt uc \s po nt d out b Kin .1 9t 
the ud\ at tuges k > . U hods a« e that 

they are 'readily available' and 'often successful in practice'. 
He also pointed out that the disadvantages are that there: is 
little use of chemical or physical theory 1 , the methods have 
'very poor explanatory power — a Hinton diagram means noth- 
i s mis s -ally rather 

poorly character! 'ed". KcmJcs. tit i the tp ten 1 
costs for training the networks were considerably higher, 
the prediction accuracy thus obtained was not higher (and 
son t imes even ) we Ml in the a tab t cal method The currei t 
study was initiated hi an attempt to develop an automated 
method based on the sub-site coupling principle that can be 
used to identify signal peptides faster and more accurately. 

© Oxford University Press 



Materials and methods 

Signal pep; des con prise he N-tern inal pas >f the secretory 
protein chain. They control the entry of virtually all proteins 
he i jathway. in both eukaryotes at p ry ore; 
(Gierasch. 1989; Ra.poporh 1992.) aire! are cleaved off by 
signal peptidase while the protein is translocated through the 
membrane. As shown in Figute i, the cleavage site is at 
(-1, +!), i.e. the location between residues -1 and t T or 
between the last residue of the siguaf peptide and the first 

s tit 1 ! i ! 1 o ii \ ! i 1 pu dicnon of 
the signal peptide of a nascent protein is immediately correlated 
with the prediction of its cleav g 1 ptid 

! ii t ei ;i 1 peptic is varied for di ei i secretory 
proteins. As shown in Figure 2, of the 1939 signal peptides 
studied by Nielsen ei al. (1997), the shortest one contains 
eight amino acid residues and the longest contains 90 residues 

Ink- the mu|om> luxe length v thin 18 25 residues The 
extreme variation "in length and st nonce h; p. ed a difficulty 
for formulating a genera! algorithm to predict the signal 
peptides. To deal with this kind of situation, let us consider a 
window with a scale of §j, -3, -2,. -1, +1, +2, %% 
(Figure 3). Such a window is called a 'scaled window' and 

highlight n - (£, + \j) + 1 different sequences. Note that for 
the current study the identification of cleavage site is very 
important because it is directly correlated with a correct 
prediction of the signal peptide. For example, instead of the. 
site (-), +1), if the cleavage site is identified at (-2, -I) or 
(H I r2) then fte cbfi din i i , 1 Ins dewed 
will be one residue shorter or 'longer than the actual one 
(Figure 1). Therefore, of the sequence segments highlighted 
by the scaled window, only the one with the residue at the 
scale -1 being the very last residue of the signal sequence is 
regarded as the secretion-cleavable segment (Figure 3a); while 
all the other segments regarded as uon-secretion-cleavable 
( - It! 0 tt.d > 1, his ^n i- sliding the sca'ed 

window [ L. - along a protein sequence: of n residues, 
i n ite oi nd n i it e lion ea\ 'We seg 
ment and n - (l ; 4 tV; non- secretion-cleavable segments if 

1 ^ 1 st ) i , t if i in e et ) A3 the sec etmn 
cieavable segments fortn a cleavabie or positive set denoted 
by S" and ail the non-secretion-cleavabie segments form a 
tion--cSeavabie or negative set S". 
Segmen get tec b "1 ding ! i 1 [ - I 

long protein sequence t i be £ leraily t ressed a 

R-^R-GH) •••R..3R_ 2 R. ! R. H R + 2-R +(& . l )R + 4 (D 

where R.<, represents the residue at the scale -<(, R_ t the 
residue at "the scale -1, R.... the residue at the scale +1 and 

so forth. 

If the amino acid residue at each of the segment subsites 
(I t ioi ' can be tn tdependent element, i.e. 

75 

Exhibit C(4) 



o * 

i " ; 



ii ion > tie. . three key des, i.e the (-3, -I, +1) coupling, 
must be taken into account. Thus, Equations 2a and 2b should 
be modified to 

^ + (R^ 1 -R_3R_ 2 R. 1 R +t R + 2-R+fe) 

=FiE 1 (R^ l )-fl; i CR_3)Pl 2 (/e_ 2 )Ptl(R- i IR-3) 

P|l(R +1 IR^ 1 )Pi:2(R + 2)-n§ 2 (R + 5 2 ) (3a) 



l P-(R-< 1 -R.3R..2.R..iR, 1 R + 2-R +fe ) 
/^ 1 !R +[ iR..[)Pr 2 (R,.,;---/ > ^ 2 (R. ti :.j 



there is no coupling at all at >sites then its 

attribute to the cleavable set S + and that to the non-cleavable 
i i i actively, as 



¥.£0M, -R.3R-.2R-1 R+ 1 R +2 -R + & 



%(R-$ t -K-sR^R-i R+ )R+2- -R + y 
=m (R^j-^CR^^CR.,) 



(2a) 



(2b) 



where P/ + (R. ; ) is the probability of amino acid R,- occurring at 
the subsite / ( = -c A , -3, -2. -I, +1, +2, .... +^ 2 ) for the 
secretion-cleavable segments and Pf(Ri) the corresponding 
probability for the non-secretion-cleavable segments. The 
values of the former can be derived iron! a positive training 
data set S,," consisting of only ;eeretiomcieavab;e segments 
and the values of the latter can be derived from a negative 
training data el Si consisth nly mm-cli t jmet ts 
The subscript 0 of y indicates that the attribute function is 
formed by nidi x 1 1 1 t 1 1 1 I 1 1 1 e 1 1 1 
effect between subsites is included, as shown by the right- 
hand side of Equation 2. However, in reality the protein 
subsites are often coupled with one another. Therefore, it is 
instructive to conduct a statistical analysis, for die 1939 
secretory protein sequences retrieved from Nielsen ei ah 
(199 the c thus bta 1 dis II str 1 ire A In 1 
wliich we can see that the amino acid residues at the subsites 
-3, -1 and + ! are mostly occupied by Ala. Furthermore, 
according to the detailed numbers generated through the 
statistical analysis, of the 1939 protein sequences, the occur- 
rence frequencies of Ala at the subsites -3, -I and + ] 
are 667, 1084 and 397, respectively, while the occurrence 
frequencies of the othei 19 amino ; k at tf se subsite are 
relatively much lower. Besides, all these three subsites are 
very close to the c vases Figure 1 l?hi suggests that 
a highly special match between the signal peptidase and the 
t ret< prote t 1 e sut tes -3 -1 and s required 
ingthe ss t t ish aow 1 

ful method for predicting the signal peptides, the coupling 
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respectively, where iV"(R,) and Pf(R ; ) are the same as those 
in Equation 2. P2.)(R_,!R_ 3 ) is the probability of amino acid 
R. t occurring at the subsite -1, given that R. ? has occurred at 
the subsite -3; / , {,(R + ,1R. ! ) is the probability of amino acid 
R+i occurring at the subsite 1 1, given that R_j has occurred 
at the subsite -1. Their values can be derived from a positive 
training data set 57, consisting of only secretion-cleavable 
peptides. Also, /^(R^IR.,) and Py.,(R +1 IR.,) have die same 
1 ii > 1 ! I > cP < M 1R.0 except that they 
are derived from a negative training data set S,f consisting of 
ot)ly non-cleavable peptides. 

Thus, for a given peptide sequence as defined in Equation 
1 , if its attribute function to the positive training set Sq*- is 
greater than that to the negative training set Sf, i.e. > \\f ~, 
then the sequence is predicted to be secretiomcleavable; 
otherwise, it is predicted to be. non-secretioo-cleavable:, We 
define a discriminant function A, given by 

A(R^ t »R_ 3 IL2K-iR* tR+2 " = 

w+f-'XR.^. • ■ R_ 3 R_2R_iR + iR + r"R + |d) 
-vi'" l F'(R.tg-R_ ? R..2R..]Ri- 1 R+2- ••R + ., 2 (4) 

where w + and MT are the weight factors for the attribute 
functions derived from the positive training data set 5§ and 
negative (raining data set Sir, respectively. If there is no special 
reason, they are generally set to be one i.e. w + = W = I. 
Tims, the criterion of predicting the seretion-cleavability for a 
given peptide sequence can be formulated as follows: 

J The peptide is secretion-cleavable, if its A > 0 
{The peptide is non-secretion-cleavable, otherwise (5) 

During the train tg proc 1 » teters 1 and c; can 
be changed so as to fit be opt I ' <i > lity t )nce 

a sec etion- ie b pt 1 i iredict d th 01 es 1 3 t 
cleavage site and signal peptide are automatically obtained as 
described above (ef Figures 1 and 3a). 

Results and discussion 

To show the power of tl e 1 t pled algorithm, tl 
following two criteria should be followed: (1) using a good 
data set that is accessible to the public and (2) comparison 
with the best result reported in the literature. The data set 
investigated by Nielsen etal. (199 ties th 11 

it can be retrieves! from an FTP server at ftp://virus.cbs.dtu.dk/ 
pub/signalp, They consist of 1939 secretory proteins and 1440 
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the non-secretory proteins, the first 70 amino acids of each 
sequence were included. Accordin > •■> t:» . . - ort, the average 
rate of correct prediction for the cleavage site location by the 
neural news orkmetho 1.5 hesi c 

rate so far reported for such a large data set available to the 
public. Therefore, the result reported by Nielsen et al (1997) 
also satisfies the second criterion. To compare the prediction 
quality at an equivalent condition, we used the same data set 
i b> el net ,</ ■ 997) 
The rate of correct prediction for the signal peptide set and 
loj _ i ) v i v i en by 

/ N + -m + 

A + = , for signal peptides 



|A- — , for non-signal peptides 

N~ 

where N represents the total number of signs! peptides and 
m* is the number of signal peptides missed in prediction: A'' 
is the total nui lbei of n< i | de md m is the numbei 

olnon sign tii i n i fly predict signal peptide. 
The overall ret of e rre jred tion concerned is given by 



Tabic I. Perfcrmara lu t it coupling model 



A + N + 



N + ■ 



+ N" 



(7) 



The predicts 
procedure it 



iity was s. i i 1 by the standard testing 
(Mardia et al, 1979), that, is, a combina- 
self-consistency and jackkmfe tests. Iti the former, 
the signal peptide of each protein in a given data set was 

i I, si i, pa in, i i 1 1 1 d - 1 i v no 
the so-called training data set, whereas in the latter, each 
protein in the training data set was singled out in turn as a 
'lest protein' and all the rale-parameters were derived from 
the remaining proteins. Compared with the independent data 
set: test and sub-sampling test often adopted in biology, the 
pekKnfe t st i- considered to be the no < e re, tt 
for cross-validation in statistics {Mardia et al, 1979). This is 
because in the independent data set test, the selection of a 
testing data set is arbitrary and the accuracy thus obtained 
Jacks an objective criterion unless the testing data set is 
sufficiently large (Chou and Zhang, 1995). As for the sub- 
sampling test in which a given data set is divided into 
several subsets, the problem is that the number of possible 
eiM i. rs mig'i x m 'i b,. ft O i 1 n , v, im It in 
the treatment by Nielsen et al (J 977), each data set was 
divided into live approximately equal size puns and then every 
network run was carried out with one part as test data and the 
other four parti as training data The performance measures 
i li ti lated i averag r the five dit if 



data of ' 



proteins, the number of possible combinations would be 
<J) = tD ( x$-'*,/0,-^ "hoc m ~ 416!/(83!i !83! 
83184!) 4>- - Mi 1 1 n: 202 O 105!/ 

(21!21!21 121121!). <S? 4 = 266!/(53!53!5J!53!54!) and <b 3 = 
!41!/(28!28!28!28!29!). Of <3>,. <3>,. «!>,„ <», and <D„ the smallest 
s d> ! i i!m ^ i ! s 1 4 it 

impossible for any existing computer to handle such a huge- 
number of combinations. In fact in any practical sub-sampling 
tests as perfor I > i""! i.o aun sma 1 

fraction of the possible combinations were investigated and 
the results thus obtained could not avoid a considerable 
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87.44 




tk +2i 


90.36 


80. iO 


89.12 
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[-16, +2] 
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92.74 


[-18, +2] 


S6.02 


93.09 


92,99 


Scaled window 


Jackkjiife test 






t-4n +-y 


A + 


A- 


A 


[-6, +2] 
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[-10. +2] 




90. M 


90.7 1 


[-12, +2] 


80. j 2 


92.10 


92.06 


[-13, +2] 


89.63 


92.46 


92.42 


(-14, +2] 


89.58 


92.57 


92.53 


[-15. +2} 


89.94 


92.66 


92.63 


[-16, + 2[ 




92.74 


92.68 


[-IS. -t'2] 


8174 


93.08 


92.93 



■So: Hqiatiiens 6 and 7 for the deiiiiiiions of A + , A' and A. 

arbitrariness. Accordingly, the testing procedure adopted here 
is much more objective and rigorous. 

Prediction was performed by selecting different parameters 
for the scaled window [-£;,, +£ 2 ]. Preliminary tests indicated 
that for a given the optima] result for A + was obtained 
when l 2 = 2. The predicted results by both self-consistency 
and jackktiife tests with different values of c, are given in 
Table I, from which we can see that the overall success rate 
A is improved with increase in c,, However, if t, is too large, 
many short signal peptides will be excluded. For example, 
two signal pep ■> ,re t lude wbei - 10 five when 
%i = 12. six when = 13, eight when £,j - 14, 13 when 
|, = 15, 52 when 16 and" 186 when'!;, === 18. Each of 
1 1 eluded signa ptidt va t led a i uccessfol 
prediction event, contributing to the reduction of the success 
rate for the prediction of signa; peptides. As a consequence, 
A*' was gradually reduced when c, 7& 16 (Table I). As a 
compromise, we select c A :::: 13, 34 or 15 and t-> = 2 as she 
i parameters! for the scaled i 1 1 i i When 
p, and q, are within these values, the success rates A + 

i,],, „ ■ , foi k signal >eptides< are o^er 93 and 89% 
b ;Sf -consistency and ja life tests, re 1 i 

i i spoi ing sue e ( the non-signal 

peptide set are both over 92%. Also, the overall success rates 
(Equation 7) for the cleavage site location by both self- 
consistency and jackknife tests are over 92%. 

Besides the neural network (NN) method proposed by 
Nielsen et al (1997), there are some other methods, such as 
the simple weight matrix method fvon Heijne, 1986), the 
hidden Markov method (Baldi and Brunak, 1998) and the 
physical sequence analysis method (Ladunga, 1999s Like 
Nielsen et al's nt hod a iese naeth e played m 
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important role in stimulating the development of this area. 
The simple w > i thod is one of the earliest practical 

i avage sites 

However, as pointed out by Nielsen el ai (1997), if "the 
original weigh htn (von Heijn >86) is d 

to" the current data set, "the performance is much lower' in 
i oropariso » wi h i 1 ' i a Markov method 
(HMVJ) also belongs to the machine learning approach; the 
term 'hidden 1 refers, to the invisibility of the underlying random 
walk between different states. Actually, the HMM method is 
a different type of a > i > 1 hoa and hence 

lb King (1 < Tl 
physical sequence analysis method, also called PHYSEAN 
method, was established on the has:-, o! the physical, chemical 
and biological characteristics of protein sequences. The 
working data sets for PHYSEAN consists of 2532 preproteins 
with signal peptides and i 13R cytosolic proteins. As described 
by Ladunga (1999), three-quarters of the sequences in the data 
sets were randomly selected to form a training set and the 
predictions were performed on the remaining one-quartet of 
seq it )i t a pi diet] a, came} ! imam <>i u H k 
proteins by five repeti! < . f a valid i experiments, 
fhc sue. i t 1 ot ' l.) i prediction of cleavage 
sites was 79.28%. It was not possible to make a direct 
comparison of the present algorithm with PHYSEAN based 
on a same data set because, unlike NN (Nielsen et al, 1997), 
the data sets in PHYSEAN are not accessible to the public. 
Moreover, as we can see, the cross-validation procedure in 
PHYSEAN is also of sub-sampling test and hence could, not 
avoid the problem of arbitrariness either. This can be illustrated, 
as follows. Even only for the 2532 preproteins, the number of 
possible sub-sampling combinations would be 2532!/ 
(63311899!) S> 10 370 . Compared with such a huge number, 
five different sub-samplings, although randomly selected, are 
merely a very tiny fraction of the possible combinations (i.e., 
the fraction of sub-samplings considered is <sC0.5X 10"' 369 ). 

Accordingly, from both the higher success rate and the more 
rationality in test procedures, it is worth communicating the 
new algorithm to those working in the area concerned. At least 
it will play a complementary rale to the existing algorithms, 
stimulating the development of protein signal peptide pre- 
diction. 
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Conclusion 

Since I he present model has explicitly incorporated the coupling 
among the subsites -3, - 1 and -H and all these subsites are 
very close to the cleavage site, it can be directly used 

to! u \^stm j i i I'u 1 i is [ w 

signal peptidase. The present mode) can also serve as a useful 
vehicle lor helping further investigate many unclear details 
k >o i Uic lo.dccubu n.Lvhnmsm of the ZIP code protein- 
sorting system in cells, Furthermore, since signal peptides are 
the key in determining the subcellular location of proteins, the 
{-3, -1, -Hi model might have some impact in improving 
tt . p d cti n quality of pi ein su xell i loca 0 d i 
et ai, 1997; Remhardt and Hubbard, 1998; Cboo and.EJrod, 

'O7og . Jt ! V, ' a \,0(,, 
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SHORT COMMUNICATION 

Identification of prokaryotic and eakaryotk signal peptides and 
prediction of their cleavage sites 



I-Ienrik Nielsen, Jacob Engelbredif 1 , Saren Brunak and 
Gunnar von Heijne 3 

i ! for Biologic^: Sernience Analysis, I [ 11 i id Cdcavsirv 
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) 'ipjitm.! ft i r > s I 'mum 

S-J06 91 Stockholm, Sweden 

Trewai address: Novo Nualisfc A./S. Sciwnuc Coimnsoy.t lOiddirg. 9M1. 

We have developed a new method for the identification of 
signal peptides and their cleavage sites based on neural 
networks trained on separate sets of prokaryotic and 
eukaryotic sequence. The method performs significantly 
better than previous prediction schemes and can easily be 
applied on genome-wide data sets. Discrimination between 
cleaved signal peptides and uncleaved N-terminai signai- 
ancbor sequences is also possible, though with lower preci- 
sion. Predictions can be made on a publicly available 
WWW server. 

Kernvnis: cleavage sites/protein sorting/secretion, signal peptide 



Introduction 

Signal peptid • control the entrj of virtually all proteins to 
i it retorv pathwa both i u) • ... 1 1 1 s ore 
(Gierasch, 1989; von Heijne, 1990; Rapoport, 1992). They 
comprise the N-terminal pari of the amino acid chain and are 
cleaved off while the protein is translocated through the 
membrane. The common structure of signal peptides from 
viiw nog m is r . u i v i d as a positively charged 
n-rcglon. followed by a hydrophobic h-region and a neutral 
but polar t i. The i i rule states that the resid i 
positions -3 and -1 (relative to the cleavage site) must be 
small and neutral for cleavaee to occur correctly fvon Heijne, 
1983, 1985). 

A strong interest in the automated identification of signal 
peptides and the prediction of their cleavage sites has been 
evoked not only by the huge amount of unprocessed data 
readable s i I ro find more effective 

vehicles for the production of proteins in recombinant systems. 
Ihe most k meet mcila-d oi pwehtuuc u en t of 
the cleavage site is a weieht matrix which was published in 
1986 h > Hvi n 19V ^ ,h s m t } d a * !s, i i, 
d i ' i between signal jieptide:; and non-signal peptides 
by using the t mi) it gge -d ^ >w i 
matrices are commonly used today, even though the amount 
of signal peptide data available has increased since 1986 by a 
factor of 5-10. 

Here, we present a combined neural network approach to 
the recognition of signal peptides and then cleavage sites 

IVIiOli o t " , 1 v i . I r 

ku'i > > 1 w s i 10 I 

peptides. A similar combination of two pairs of networks has 
been used with success to predict the intron splice sites 



in pre-tr.RNA from humans and the dicoteiydoneous plant 
(Bruna t J , 19>H Slleb gaard 
[ s it I ho ,. f i t ) >. i id i i . tl i'n i i 

Vrtrfieiai neural networks have hen used hot mam biological 
sequence analysis problems (Hirst and Sternberg, 1992; 
iiesncll and c 99 fhey h so been alied t 
1 1 vin oh v t s f predict:! i les and '1 < i 

cleavage sites, but until now without leading to practically 
applicable pw an at m hod cm ici iu.m m. o C i ent- 
in performance compared with, the weight matrix method 
(Arrigo et al, 1991; hadunga ol., 199':; Schneider and 
Wrede, 1993). 

Materials and methods 

The data were taken from SWISS-PROT version 29 (Bairoch 
and Boeckmann, 1994). The data sets were divided into 
n >w wn u < .i a ] cs 3 i,j tiir pioLaryotic data sets 
were further divided into Gram-positive cobactcria (Firmicutes) 
and Gram-negative eubacteria (GretcUicutes), excluding 
Mycoplasma and Archaebacteria. Viral, phage and organellar 
proteins were not included. In addition, two single-species 
data sets were selected, a human subset ot 1 ikaryotic lata 
a r i s i 'tt'uhji i gttive data. 

The sequence of the signal peptide and the first 30 amino 
acids of i.s pints gi tc ton d , t jrotem were 
included in the data set. The first 70 amino acids of each: 
sequence were used from the cytoplasmic and (for the outsat y- 
otes) nuclear proteins. In addition, a set of eukaryotic signal 
anchor sequences, i.e. N-tenninal parts of type If membrane 
proteins (von Heijne, 1988), were ext eel (se 1 igure 1 ) 

As an etat e ot c , on n u k t msl ] 
method, we used the Haemophilus influenzae Rd genome— 
the first genome of a tree-living organism to be completed 
(Fteischmarin etal, 1995). We have downloaded the sequences 

t it i c f t 1 , 1 ' I ' ' t/i gctlOll 

from the \\ 

Genomic Research at ht!p://www.tigr.org/. Only the first 60 
positions of each sequence were analysed. 

We have attempted to avoid signal peptides where the 
i i ( ! > but wc ire 

not able to eliminate them completely, since many database 
entries simply lack information about the quality til" the 
. it o, h, h s I 'i . s v. r >r is i n I 

o - m 1 r p i N I l 1 a t! 1''° >al 

Redundancy in the data sets was avoided by excluding pairs 
of sequences which were functionally homologous, i.e. those 

thai had it rhan 1" (t a i t _ (mil (it 

matches in a local alignment (Nielsen ct al., 1996a). Redundant 
sequences were removed using an algorithm which guarantees 
that no pairs of homologous sequences int in the data se! 
(Hobohm et al, 1992). This procedure removed 13-56% of 
the sequences. The numbers of non-homologous sequences 
remaining in the data sets are shown in Table I. Redundancy 



© Oxford University Prosit 



1 

Exhibit C(5; 



H.Nielsen et as. 



Tabic 1. £>ate and performance values 
Source Data . 

Network arvHitcthsre .. Jj 111 Performance 



hNiimbe; of sequences I 



O , ii).< 

0 89ig>.sV/) 



u mini f | r i i , i , iii us n 1 J' 

Tj group:, are entjry. a,.-s. luanan. Gram-negative bacier.a ('G;;<rri--'!. !z.::.-v'i are l dino-ixeiiive bacteria ("Gi 
In .1 fi a ti it i ti i 1 



Tire values riven in njre.rihevas redirale lire iiertorrnar.ve for trie human seqnorves »hc:i laang aervod.v trained on all euk.aryobc :iala and ic 
se c I v 1 5-C t,vl> 



reduction was not applied to the signal anchor data or the 
H.infiuenzae data, since these were not used as training data, 

■ ,< i . < \ » 
'['he signal peptide problem was posed to due neural networks 
in two ways: (i) recognition of the cleavage sites against the 

1 .i ^ . u I pi) i , - 1 . , 

of amino n . s • x ' i i ■ in cj o K •> , i t W 
latter case, negative examples included both die i 70 positions 
of non-secretory proteins and the first 30 positions of the mature 
part of secretory proteins. 

The neuial t erw irks see e fee -id > a'd networks with zero 
or one layer of two to 10 hidden units, trained using back- 
propagation (Rumethart et al, 1986) with a slightly modified 
i< is a t 11 s k i w tesemed to the network 
using sparsely encoded ton".,. - i . and Sejnowski, 
1988; Brunak el al. A 99 i ), Symmetric and asymmetric windows 
of a size varying from five to 79 positions were tested. 

Based or. the numbers of correctly and incorrectly predicted 
positive md mplcs, v. „ i ailatcd the correlation 

coefficient (Mathews, 1975). The correlation coefficients of both 
the training and test sets were monitored during training and the 
performance of the trainiau cycle with the maximal test set 
correlation was recorded for each training run. The networks 
chosen for inclusion in the WWW server have been trained until 
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correlation coefficients. We did not pick the architecture with 
absolutely the besi performance, but instead the smallest network 
1 o e v 1 r r v i.i 1 ^ . , 

wn dow or adding i liddet units 



The trained networks provide two different scores between 
zero and one for each position in an amino acid sequence. The 
output from the signal p pride iou-su nal peptide networks, the 
S-score, can be interpreted as an estimate of the probability of 
i ii i t r 1 

. t ut k iv k v si ij'mh vs h. i si me 

i , t iii. ' h ' n i\ni n ii 

r first i ii otein ition . e to the 

cleavage site). 

u i v d i n x ot t i r 
true cieavage site may often be found by inspecting the S-score 
curve in otde to -m > 1 - p k <« d bo 
with the transition born the tlgnai peptide to the non-signal 
peptide region. In order to fornuh e I i we the predic- 

tion, we have tried a number of linear and non-linear combina- 
tions of the raw network scores ar.d evaluated the percentage of 
sequences with correctly placed cleavage sites in the five test 
sets. Thebes of ' score 

and a smoothed derivative of the: S-scorc. termed the Y-score: 

Y; = >/ C AA, (i) 

where 4/5, is the difference between the average S-score of d 
positions before and d position after position /: 

In Figure 2(A), examples of the values of the C-. S- and Y- 
scores are shown for a typical signal peptide with a typical 
tie ' s t Tfo i s^ik i mi n 

to an abrupt change in the S-score from a high to iow value. 
\niong the real examples, the ( - I peak 

and theS-s. i ' . . > > 

r eil I e i i t the Mee.io j ion ;p >nd 

to the maximal Y-score (combined score). 

Pot typical non-s n lues of tm S 

and Y-scores are lower, as shown in Figure 2(B). We found the 
best discriminator between signal peptides and non-secretory 
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■> k 1 1 s i i t i i ^ > i 1 , 'j i i ; i 

itkle regu from p ion i to t 
before tin; position where the V-score has a maximal value, !f 
this value — the mean S-score — is greater than 0.5, we predict 
the sequence in question to be a signal peptide (cf. Figure 3), 

[ s 'it t i it i ist - 

U 1 e , 1 t h 1 i i i R s , I Iv 

in detail elsewhere (Niciscn etal., 1997). 
Results and discussion 

"The opnroal network architecture and corresponding predictive 
performance for all the data sets are shown in Table i. The C- 



scorc i ilern is best solved by netwotks \uth as j 
" r do s e j s1 tut hi t 

i tit It c i esponds well with 

1 e aior it ieavage e p n infortna vhich is 
jhov i is sequel cc tit 990) t \ 

Figure 1. The S-score problem, on the other hand, is best 

SOiVed ' Pft ip( W 1 >lj w, . l >• 1 lu'» 

Although our method is able to locate cleavage sites and 
discriminate signal peptides trom non-secretory proteins with 
a reasonably high reliability, the accuracy of the cleavage site 
location is lower titan that reported for the original weight 
matrix method (von Heijne, 1986): 78% for eukaryotes and 
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vs. i.e. the first 

pesidon in the mature protein, i'A) A sueoo-tsiuiiv predicted sippeti peptide. 
The tree cieavaae site is marked svih an ttrtove. i.B) A ivse-seceehere preica!. 
For many notvsccretorv j 1 u a!i three scotys are v - lev, la n 
the seqitenee. In this osatvpte. '.here are peaks of the C- and S-seeres : but 
the sequence is stiii et-caly .da^ahee as notvseereeasy. : ar,ce the C-score 
peak occurs tat arose Iroiii five S-score deehne and the rcpiott of the high 
S-seore is far too short. 
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89% For prokaryotes (not divided into Grant-positive and 
i e). When <ti i s ' . tti tppi > oi ' 
recent data set, however, the performance is much tower. This 
suggests a larger variation tti the examples of the signal 
peptides found stnee then. It may, of course, also reflect a 
h ither t i et a pi errors in oi r antoma e I < 

than in the manually selected 1986 set. 

In order to compare the strength of the neural network 
approach to the weight matrix method, we recalculated new 
veighi ces From out \% ta and teste 
of these (results not shown). The weighs matrix method was 
comparable to the neural networks when calculating the Ci- 
sco! , t it e able to solve the S-score p iblem 



eieaeahte stated peottde the tvedteted tturrtbet of secrc-tore proteoses tn 
/;' o.Ph,e;s:ec icerre'pxavittv; to t] area t tee carw ttt the t of the 
rou. is 330 o of i680 (?-h ) 

and im t ' lot poiyide h ^ - mi a o! . i t 1 1 ti > the 
combined Y-score. 

Note that the prediction performances reported here corre* 
spotid to minimal values. The test seta ttt the cross-validation 
have a. very low sequence similarity: in fact, the sequence 
similarity is so low that the correct cleavage sites cannot be 
found by alignment (Nielsen et al, 1996a). This means that 
the prediction accuracy on sequences, with some similarity to 
the sequences in the data sets will in general be higher. 

i (Terences between the sigi 1 
organisms art: apparent from Figure: 1. The signal peptides 
from Gram-positive bactei ta ate cotisidc tab' , i 
of other organisms, with much more extended h-regions, as 
observed pievtously (von Hup e ,n \brahn en 1989) i he 

, t t _ ts i Im d J t Kill i 1 Ma ( \! 

in approxim i is and m the if p ctn\ 

are dominated by Leu with some occurrence of Val (V), Ala, 
Phe (F) and lie (I). Close to the cleavage site, the 
( -3, -1) Rile is clearly visible for ail three data sets, but 
while a number of different amino acids are accepted in the 
eukaryotes t i < k i a. i it exclusively 

in these two positions. In the first few positions of the mature 
proton (downstream of the cleavage site) the prokaryotes 
show certain preferena 
amino acids, and hyd. 
pattern can be seen for 



a Her 



e po: 



tit; e. 



(Rij 



mew-hat 
;ind aim 



visible in the figm 
with the hypothesis that 
the n-region where the j 
prokaryotes. but not nec< 
N-terminal Met in its 
(von Heijne, 1985), 

The difference in struct! 
f th lined neural networks t ble Gram-negat 
cleavage sites have the strongest pattern-- i.e. the highest 
information content— and, consequently, they are the easiest 
to predict, both at the single-position and at the sequence level. 
The eukaryotic cleavage sites are significantly more difficult 
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itk-ntiiirmiuu of pfokai-yotic iiml cukarvotle signal peptides 
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protein 



;cted from the ogos (Fig ire I since they 

ready as high an information content as the Gram- 
c cleavage sites, but the longer Gram-positive signal 
the cleavage sites have to be located 
a larger background of non-cleavage she positions, 
(crimination of signal peptides versus non-secretory 
,, on the o'her ha is Her i in i i 1 i 

for the prokaryotes. This may be due to the more characteristic 
i i 1 1 r in of the cukaryotic signal peptides. 
The log s for t nan and lat n 

since l 

' i c- v Accord- 

i the predictive | ' h>uk was not improved by training 
tj u oirUi si i i , - 1 i < t it contrary, the 
B.cod signal peptides are predicted even better by the Gram- 
■ i s n i 1 < » irk< (piobably due 

to the relatively small size of the E.coli data sen. in other 
words, we have found no evidence for species-specific features 
of the signal peptides of humans and E.coli. 

Signa chore often have sites si i pti 

cleavage sites after their hydrophobic (transmembrane) region. 
Therefore, a prediction method can easily be expected to 
it l .signal anchor-, ioi j h ' i i 1 i it i 

of the mean Si i or the 9 i t 1 c s i i s is 
included, ft shows some overlap with the signal peptide 
distribution. If the standard cut-off of 0.5 is applied to the 



are apniied together, leaving omv -typical' signal \ in we 
^t i , sv , HUI ^ - o. 

Some of the sequences predicted to be signal peptides 
according to the S-score but not according to the Y-score may 
be signal anchor-like sequences of type 11 (single-spanning) 

is )! PI 1 u -> i r 1 ,1 ' [ o i 

s strengthened i iiv.ptiobiclty a sis of tl ibignou 

ipl MMitMtot It U Of h Is I J t _tl 

ci ,t )ti ii ec for the discrirnm of na) k tots 
versus signal peptides ui eukaryotes (0.62) to the mean S- 
score, the estimate is lowered from 20 to 15%. 

1 It'll el) mi i d t i to t 

signal peptides according to the maximal Y-score but not the 
lean S-score la is the effec ti i codon < .1% 

predicted coding region having been placed too far upstream. 
In this case, the apparent signal peptide bet onv. . ioo long tad 
the region between the false and the true Initiation codon will 
probably not have signal peptide character, thereby bringing 
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ponding figure for the human signal anchors is 75% when 
using human networks and 68% when using eukaryotic net- 
works). With a cut-off optimized for signal anchor versus 
signal peptide disci initiation (0.62) we were able to lower 
this enrol t tte o 45 for tl n.kan otic data se fhe mea 
S s s dtl the mammal C- or 

i indicates that the pseudo-cleavage sites are in 
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fact n 

However, the pseudo-cleavage 
the N tut no 1 m i ^ a v. t ! tt >f 

accept na! pent oiigcr tha 5 tes t Hi k 

nh i I , r t i 1 

the percentage of taise positives among the signal anchors 

lOp s , J l it i 1 1 t C I It 

t k i i L ng euka c etworks) When taking 

t is m ) i . i out 1 % 1 1 IO | tl i t t 1 - 
discrimination between signal peptides and signal anchors. 
'This h is t i n en ei r, cd b) a y of the ea i 
methods for signal peptide recognition. 

So » ' ! > < > ' a ' 

We have applied the prediction method with networks trained 
or lie ii n ' i t t in ui m o'-tqi 

0 1 predi 

genome The ' s bt i t k t i v i 

1 to the oostiion vlt a maximal Y-score) is shown a Figure 4. 

When applying the optimal cut-off value found for the 

the number of sequences t , I > i signal peptide:; in 
' o t oW ot Id O s^qimccs H p[ ,r"MJ' 

20% If the maximal S-scorc is used instead of the mean S- 
score, the estimate conies out as 28% and with the maximal 
Y-score it i 1 is not shown If all three criteria 
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contain more methionines. 

in conclusion, wc estimate that 15 20% of the H .influenzae 
pre e t in i foweve < non s s j t 

this would be tro , ' f u s 1 t\ 1 i nhct analyses, 
notably transmembrane segment pre 1 cti >ns m lit ition site 
predictions. 

Method and data publicly available 

The finished prediction method is available both via an e-mail 

t od \ \ ti i ibtiiit theii iwn amino 

acid sequences n; tinier to predict whether the sequence is a 
signal peptide and, if so, where it will be cleaved. We 
MTV that only die s i m i <1 pari (say 50-70 ammo 

lO ti s i in i j i ( s 1 t < k . * mre< ition 

of the output is not obscured by false positives further 
downstream in the protein. 

The user is asked to choose between the network ensembles 
trained on data from Gram-positive, Gram-negative or eukary- 
otic organisms. We did not include the networks trained on 
i pec t,i s he servers, since tli lid not 
improve the performance. 

The values of the C-, S- and Y-scores are returned for every 
position in the submitted sequence. In addition, the maximal 
Y-score, maximal S-score and mean S-score values are given 
for the entire sequence and compared with the appropriate cut- 
off. If the e i u is ted to ti pet id 1 
position with the maximal Y-score is mentioned as the most 
1 1 in postscript format 
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ring only 



We strongly recommit 
for the interpretation of the 
about, for example, multipl 
assigned initiation, which wc 
the maximal or mean score values. 

ri tes of t nail server is go dp ebs i c 1 oi 
detailed instructions, send a mall containing the word 'help* 
only. The WWW server is accessible via the Center for 
Biological Sequence Analysis homepage at http:// 
www.cbs.dhi.dk/. 

Ail the data sets mentioned in Table 1 are available from an 
FTP server at ftp://vtrcts.tdis.dtu.dk/pnb/signalp. Retrieve the 
file av.ADMti for detailed descriptions of the data and the format. 
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The FTP server ana the mail server can both be accessed 
directly from the WWW server. 



References 

ArrigoJ'.. Guilian<va. s it i Rapaiio.A. ana Daniiam.G. (1991) io OS, 

, J * > U < i 22 ^ m 

iivunak.S.. InnaslbreGiU. ana iamxlsea.S. (1991;/ *fo/ a';.-.'.. 220. 49-63. 
II tun . i " , ' ' 2< fl - i4 

i s 1 i 1 

ti r I / ' > s 1 
409-417. 

Ladanaa,!.. Gzaka,!',, Gsabai.i. and 1 G99t) Ola'OiT. 7. ' 

ti iiiu I 405,442-4 

tit It i ,i i < F J 6a) /to 24 

1\ G ii 1 1 1 It i t i Nil 

M i » » n < a ■ 1 . 'a i ' liophs Siomo! Stnta., 22, 
281-298.. 

' t . S r o/ S65-884. 

1 f 

inn 'hi 1' mi at li > l 'n I 

McCiciianiU. n t t i ., l 3Va ::!:e! D^nbuted 

Foundations. Mi'f Prca,. Gaadaidac. MA. sp. 31S-363. 
Sctirieider.G. and Wreda.P. ( ;99a). /". M./. r.vo! . 56, 5x6-393. 
Sclmeidcr.T,D. and Siephens.R.M. 1 A»afca ,n /* /6a. , IS, 6097-6100, 
\t i t ,] . , - 
M,, Ikun. (. tt6 ' ! 1 

vor. Hdjne.G 11985)../ Ma'. ,5/,a/„ i 84, vv-135. 
n nil I ^ i t U 1 

I v r 

it i i ;o 

t i t il i i 1 44 Ja 

/?,.(,' /v, ;?9rt ; , , s , , , 3, '996; i ; u o < » i > 

72, 



6 



